Evaluating Lemmatization Models for Machine-Assisted Corpus-Dictionary Linkage

نویسندگان

  • Kevin Black
  • Eric K. Ringger
  • Paul Felt
  • Kevin D. Seppi
  • Kristian Heal
  • Deryle W. Lonsdale
چکیده

The task of corpus-dictionary linkage (CDL) is to annotate each word in a corpus with a link to an appropriate dictionary entry that documents the sense and usage of the word. Corpus-dictionary linked resources include concordances, dictionaries with word usage examples, and corpora annotated with lemmas or word senses. Such CDL resources are essential for many tasks including assisting language learners, linguistic research, philology, and translation. Lemmatization is a common approximation to automating corpus-dictionary linkage, where lemmas stand in for the headwords of an actual dictionary. In our machine-assisted CDL system design, data-driven lemmatization models provide machine assistance to human annotators performing the actual CDL task. Assistance is provided in the form of pre-annotations that will reduce the costs of CDL annotation. In this work we adapt the discriminative string transducer DirecTL+ to perform lemmatization for classical Syriac, a low-resource language. We compare the accuracy of DirecTL+ with the Morfette discriminative lemmatizer. DirecTL+ achieves 96.92% overall accuracy, an improvement of 0.86% over Morfette but at the cost of a longer time to train the model. Error analysis on the models provides guidance on how to apply these models in a machine assistance setting for corpus-dictionary linkage.

منابع مشابه

Joint Lemmatization and Morphological Tagging with Lemming

We present LEMMING, a modular loglinear model that jointly models lemmatization and tagging and supports the integration of arbitrary global features. It is trainable on corpora annotated with gold standard tags and lemmata and does not rely on morphological dictionaries or analyzers. LEMMING sets the new state of the art in token-based statistical lemmatization on six languages; e.g., for Czec...

متن کامل

Comparison of Different Lemmatization Approaches through the Means of Information Retrieval Performance

This paper presents a quantitative performance analysis of two different approaches to the lemmatization of the Czech text data. The first one is based on manually prepared dictionary of lemmas and set of derivation rules while the second one is based on automatic inference of the dictionary and the rules from training data. The comparison is done by evaluating the mean Generalized Average Prec...

متن کامل

Enhancing Lemmatization for Mongolian and its Application to Statistical Machine Translation

Lemmatization is crucial in natural language processing and information retrieval especially for highly inflected languages, such as Finnish and Mongolian. The state-of-the-art method of lemmatization for Mongolian does not need a noun dictionary and is scalable, but errors of this method are mainly caused by problems related to part of speech (POS) information. To resolve this problem, we inte...

متن کامل

Machine Learning of Morphosyntactic Structure: Lemmatizing Unknown Slovene Words

Automatic lemmatization is a core application for many language processing tasks. In inflectionally rich languages, such as Slovene, assigning the correct lemma (base form) to each word in a running text is not trivial, since for instance, nouns inflect for number and case, with a complex configuration of endings and stem modifications. The problem is especially difficult for unknown words, sin...

متن کامل

A Dictionary- and Corpus-Independent Statistical Lemmatizer for Information Retrieval in Low Resource Languages

We present a dictionaryand corpus-independent statistical lemmatizer StaLe that deals with the out-of-vocabulary (OOV) problem of dictionary-based lemmatization by generating candidate lemmas for any inflected word forms. StaLe can be applied with little effort to languages lacking linguistic resources. We show the performance of StaLe both in lemmatization tasks alone and as a component in an ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

متن کامل
عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2014